{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RYmPh1qB_KO2"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "oMRm3czy9tLh"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ooXoR4kx_YL9"
      },
      "source": [
        "# TF Lattice 聚合函数模型"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BR6XNYEXEgSU"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td><a target=\"_blank\" href=\"https://tensorflow.google.cn/lattice/tutorials/aggregate_function_models\">     <img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">     在 TensorFlow.org 上查看</a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/lattice/tutorials/aggregate_function_models.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行 </a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/lattice/tutorials/aggregate_function_models.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 中查看源代码</a></td>\n",
        "  <td><a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/lattice/tutorials/aggregate_function_models.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\"> 下载笔记本</a></td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-ZfQWUmfEsyZ"
      },
      "source": [
        "## 概述\n",
        "\n",
        "利用 TFL 预制聚合函数模型，您可以快速轻松地构建 TFL `tf.keras.model` 实例来学习复杂聚合函数。本指南概述了构造 TFL 预制聚合函数模型并对其进行训练/测试所需的步骤。 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L0lgWoB6Gmk1"
      },
      "source": [
        "## 设置\n",
        "\n",
        "安装 TF Lattice 软件包："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ivwKrEdLGphZ"
      },
      "outputs": [],
      "source": [
        "#@test {\"skip\": true}\n",
        "!pip install tensorflow-lattice pydot"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VQsRKS4wGrMu"
      },
      "source": [
        "导入所需的软件包："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j41-kd4MGtDS"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "\n",
        "import collections\n",
        "import logging\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import sys\n",
        "import tensorflow_lattice as tfl\n",
        "logging.disable(sys.maxsize)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZHPohKjBIFG5"
      },
      "source": [
        "下载 Puzzles 数据集："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VjYHpw2dSfHH"
      },
      "outputs": [],
      "source": [
        "train_dataframe = pd.read_csv(\n",
        "    'https://raw.githubusercontent.com/wbakst/puzzles_data/master/train.csv')\n",
        "train_dataframe.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UOsgu3eIEur6"
      },
      "outputs": [],
      "source": [
        "test_dataframe = pd.read_csv(\n",
        "    'https://raw.githubusercontent.com/wbakst/puzzles_data/master/test.csv')\n",
        "test_dataframe.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XG7MPCyzVr22"
      },
      "source": [
        "提取并转换特征和标签"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bYdJicq5bBuz"
      },
      "outputs": [],
      "source": [
        "# Features:\n",
        "# - star_rating       rating out of 5 stars (1-5)\n",
        "# - word_count        number of words in the review\n",
        "# - is_amazon         1 = reviewed on amazon; 0 = reviewed on artifact website\n",
        "# - includes_photo    if the review includes a photo of the puzzle\n",
        "# - num_helpful       number of people that found this review helpful\n",
        "# - num_reviews       total number of reviews for this puzzle (we construct)\n",
        "#\n",
        "# This ordering of feature names will be the exact same order that we construct\n",
        "# our model to expect.\n",
        "feature_names = [\n",
        "    'star_rating', 'word_count', 'is_amazon', 'includes_photo', 'num_helpful',\n",
        "    'num_reviews'\n",
        "]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kx0ZX2HR-4qb"
      },
      "outputs": [],
      "source": [
        "def extract_features(dataframe, label_name):\n",
        "  # First we extract flattened features.\n",
        "  flattened_features = {\n",
        "      feature_name: dataframe[feature_name].values.astype(float)\n",
        "      for feature_name in feature_names[:-1]\n",
        "  }\n",
        "\n",
        "  # Construct mapping from puzzle name to feature.\n",
        "  star_rating = collections.defaultdict(list)\n",
        "  word_count = collections.defaultdict(list)\n",
        "  is_amazon = collections.defaultdict(list)\n",
        "  includes_photo = collections.defaultdict(list)\n",
        "  num_helpful = collections.defaultdict(list)\n",
        "  labels = {}\n",
        "\n",
        "  # Extract each review.\n",
        "  for i in range(len(dataframe)):\n",
        "    row = dataframe.iloc[i]\n",
        "    puzzle_name = row['puzzle_name']\n",
        "    star_rating[puzzle_name].append(float(row['star_rating']))\n",
        "    word_count[puzzle_name].append(float(row['word_count']))\n",
        "    is_amazon[puzzle_name].append(float(row['is_amazon']))\n",
        "    includes_photo[puzzle_name].append(float(row['includes_photo']))\n",
        "    num_helpful[puzzle_name].append(float(row['num_helpful']))\n",
        "    labels[puzzle_name] = float(row[label_name])\n",
        "\n",
        "  # Organize data into list of list of features.\n",
        "  names = list(star_rating.keys())\n",
        "  star_rating = [star_rating[name] for name in names]\n",
        "  word_count = [word_count[name] for name in names]\n",
        "  is_amazon = [is_amazon[name] for name in names]\n",
        "  includes_photo = [includes_photo[name] for name in names]\n",
        "  num_helpful = [num_helpful[name] for name in names]\n",
        "  num_reviews = [[len(ratings)] * len(ratings) for ratings in star_rating]\n",
        "  labels = [labels[name] for name in names]\n",
        "\n",
        "  # Flatten num_reviews\n",
        "  flattened_features['num_reviews'] = [len(reviews) for reviews in num_reviews]\n",
        "\n",
        "  # Convert data into ragged tensors.\n",
        "  star_rating = tf.ragged.constant(star_rating)\n",
        "  word_count = tf.ragged.constant(word_count)\n",
        "  is_amazon = tf.ragged.constant(is_amazon)\n",
        "  includes_photo = tf.ragged.constant(includes_photo)\n",
        "  num_helpful = tf.ragged.constant(num_helpful)\n",
        "  num_reviews = tf.ragged.constant(num_reviews)\n",
        "  labels = tf.constant(labels)\n",
        "\n",
        "  # Now we can return our extracted data.\n",
        "  return (star_rating, word_count, is_amazon, includes_photo, num_helpful,\n",
        "          num_reviews), labels, flattened_features"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Nd6j_J5CbNiz"
      },
      "outputs": [],
      "source": [
        "train_xs, train_ys, flattened_features = extract_features(train_dataframe, 'Sales12-18MonthsAgo')\n",
        "test_xs, test_ys, _ = extract_features(test_dataframe, 'SalesLastSixMonths')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KfHHhCRsHejl"
      },
      "outputs": [],
      "source": [
        "# Let's define our label minimum and maximum.\n",
        "min_label, max_label = float(np.min(train_ys)), float(np.max(train_ys))\n",
        "min_label, max_label = float(np.min(train_ys)), float(np.max(train_ys))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9TwqlRirIhAq"
      },
      "source": [
        "设置用于在本指南中进行训练的默认值："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GckmXFzRIhdD"
      },
      "outputs": [],
      "source": [
        "LEARNING_RATE = 0.1\n",
        "BATCH_SIZE = 128\n",
        "NUM_EPOCHS = 500\n",
        "MIDDLE_DIM = 3\n",
        "MIDDLE_LATTICE_SIZE = 2\n",
        "MIDDLE_KEYPOINTS = 16\n",
        "OUTPUT_KEYPOINTS = 8"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TpDKon4oIh2W"
      },
      "source": [
        "## 特征配置\n",
        "\n",
        "使用 [tfl.configs.FeatureConfig](https://tensorflow.google.cn/lattice/api_docs/python/tfl/configs/FeatureConfig) 设置特征校准和按特征的配置。特征配置包括单调性约束、按特征的正则化（请参阅 [tfl.configs.RegularizerConfig](https://tensorflow.google.cn/lattice/api_docs/python/tfl/configs/RegularizerConfig)）以及点阵模型的点阵大小。\n",
        "\n",
        "请注意，我们必须为希望模型识别的任何特征完全指定特征配置。否则，模型将无法获知存在这样的特征。对于聚合模型，将自动考虑这些特征并将其处理为不规则特征。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_IMwcDh7Xs5n"
      },
      "source": [
        "### 计算分位数\n",
        "\n",
        "尽管 `tfl.configs.FeatureConfig` 中 `pwl_calibration_input_keypoints` 的默认设置为“分位数”，但对于预制模型，我们必须手动定义输入关键点。为此，我们首先定义自己的辅助函数来计算分位数。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "l0uYl9ZpXtW1"
      },
      "outputs": [],
      "source": [
        "def compute_quantiles(features,\n",
        "                      num_keypoints=10,\n",
        "                      clip_min=None,\n",
        "                      clip_max=None,\n",
        "                      missing_value=None):\n",
        "  # Clip min and max if desired.\n",
        "  if clip_min is not None:\n",
        "    features = np.maximum(features, clip_min)\n",
        "    features = np.append(features, clip_min)\n",
        "  if clip_max is not None:\n",
        "    features = np.minimum(features, clip_max)\n",
        "    features = np.append(features, clip_max)\n",
        "  # Make features unique.\n",
        "  unique_features = np.unique(features)\n",
        "  # Remove missing values if specified.\n",
        "  if missing_value is not None:\n",
        "    unique_features = np.delete(unique_features,\n",
        "                                np.where(unique_features == missing_value))\n",
        "  # Compute and return quantiles over unique non-missing feature values.\n",
        "  return np.quantile(\n",
        "      unique_features,\n",
        "      np.linspace(0., 1., num=num_keypoints),\n",
        "      interpolation='nearest').astype(float)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9oYZdVeWEhf2"
      },
      "source": [
        "### 定义我们的特征配置\n",
        "\n",
        "现在我们可以计算分位数了，我们为希望模型将其作为输入的每个特征定义一个特征配置。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rEYlSXhTEmoh"
      },
      "outputs": [],
      "source": [
        "# Feature configs are used to specify how each feature is calibrated and used.\n",
        "feature_configs = [\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='star_rating',\n",
        "        lattice_size=2,\n",
        "        monotonicity='increasing',\n",
        "        pwl_calibration_num_keypoints=5,\n",
        "        pwl_calibration_input_keypoints=compute_quantiles(\n",
        "            flattened_features['star_rating'], num_keypoints=5),\n",
        "    ),\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='word_count',\n",
        "        lattice_size=2,\n",
        "        monotonicity='increasing',\n",
        "        pwl_calibration_num_keypoints=5,\n",
        "        pwl_calibration_input_keypoints=compute_quantiles(\n",
        "            flattened_features['word_count'], num_keypoints=5),\n",
        "    ),\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='is_amazon',\n",
        "        lattice_size=2,\n",
        "        num_buckets=2,\n",
        "    ),\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='includes_photo',\n",
        "        lattice_size=2,\n",
        "        num_buckets=2,\n",
        "    ),\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='num_helpful',\n",
        "        lattice_size=2,\n",
        "        monotonicity='increasing',\n",
        "        pwl_calibration_num_keypoints=5,\n",
        "        pwl_calibration_input_keypoints=compute_quantiles(\n",
        "            flattened_features['num_helpful'], num_keypoints=5),\n",
        "        # Larger num_helpful indicating more trust in star_rating.\n",
        "        reflects_trust_in=[\n",
        "            tfl.configs.TrustConfig(\n",
        "                feature_name=\"star_rating\", trust_type=\"trapezoid\"),\n",
        "        ],\n",
        "    ),\n",
        "    tfl.configs.FeatureConfig(\n",
        "        name='num_reviews',\n",
        "        lattice_size=2,\n",
        "        monotonicity='increasing',\n",
        "        pwl_calibration_num_keypoints=5,\n",
        "        pwl_calibration_input_keypoints=compute_quantiles(\n",
        "            flattened_features['num_reviews'], num_keypoints=5),\n",
        "    )\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9zoPJRBvPdcH"
      },
      "source": [
        "## 聚合函数模型\n",
        "\n",
        "要构造 TFL 预制模型，首先从 [tfl.configs](https://tensorflow.google.cn/lattice/api_docs/python/tfl/configs) 构造模型配置。使用 [tfl.configs.AggregateFunctionConfig](https://tensorflow.google.cn/lattice/api_docs/python/tfl/configs/AggregateFunctionConfig) 构造聚合函数模型。它会先应用分段线性和分类校准，随后再将点阵模型应用于不规则输入的每个维度。然后，它会在每个维度的输出上应用聚合层。接下来，会应用可选的输出分段线性校准。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "l_4J7EjSPiP3"
      },
      "outputs": [],
      "source": [
        "# Model config defines the model structure for the aggregate function model.\n",
        "aggregate_function_model_config = tfl.configs.AggregateFunctionConfig(\n",
        "    feature_configs=feature_configs,\n",
        "    middle_dimension=MIDDLE_DIM,\n",
        "    middle_lattice_size=MIDDLE_LATTICE_SIZE,\n",
        "    middle_calibration=True,\n",
        "    middle_calibration_num_keypoints=MIDDLE_KEYPOINTS,\n",
        "    middle_monotonicity='increasing',\n",
        "    output_min=min_label,\n",
        "    output_max=max_label,\n",
        "    output_calibration=True,\n",
        "    output_calibration_num_keypoints=OUTPUT_KEYPOINTS,\n",
        "    output_initialization=np.linspace(\n",
        "        min_label, max_label, num=OUTPUT_KEYPOINTS))\n",
        "# An AggregateFunction premade model constructed from the given model config.\n",
        "aggregate_function_model = tfl.premade.AggregateFunction(\n",
        "    aggregate_function_model_config)\n",
        "# Let's plot our model.\n",
        "tf.keras.utils.plot_model(\n",
        "    aggregate_function_model, show_layer_names=False, rankdir='LR')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4F7AwiXgWhe2"
      },
      "source": [
        "每个聚合层的输出是已校准点阵在不规则输入上的平均输出。下面是在第一个聚合层内使用的模型："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UM7XF6UIWo4T"
      },
      "outputs": [],
      "source": [
        "aggregation_layers = [\n",
        "    layer for layer in aggregate_function_model.layers\n",
        "    if isinstance(layer, tfl.layers.Aggregation)\n",
        "]\n",
        "tf.keras.utils.plot_model(\n",
        "    aggregation_layers[0].model, show_layer_names=False, rankdir='LR')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0ohYOftgTZhq"
      },
      "source": [
        "现在，与任何其他 [tf.keras.Model](https://tensorflow.google.cn/api_docs/python/tf/keras/Model) 一样，我们编译该模型并将其拟合到我们的数据中。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uB9di3-lTfMy"
      },
      "outputs": [],
      "source": [
        "aggregate_function_model.compile(\n",
        "    loss='mae',\n",
        "    optimizer=tf.keras.optimizers.Adam(LEARNING_RATE))\n",
        "aggregate_function_model.fit(\n",
        "    train_xs, train_ys, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, verbose=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pwZtGDR-Tzur"
      },
      "source": [
        "训练完模型后，我们可以在测试集中对其进行评估。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RWj1YfubT0NE"
      },
      "outputs": [],
      "source": [
        "print('Test Set Evaluation...')\n",
        "print(aggregate_function_model.evaluate(test_xs, test_ys))"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "aggregate_function_models.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
